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VOICE  PREPROCESSOR  FOR  DIGITAL 
VOICE  APPLICATIONS 


INTRODUCTION 

For  many  years,  digital  voice  processors  have  been  used  to  transmit  speech  information  at  low 
bit  rates  in  secure  voice  applications.  Digital  voice  processors  are  increasingly  being  used  for  recog¬ 
nizing  speech  or  speakers  to  facilitate  human-computer  interaction  (Fig.  1).  In  any  of  these  applica¬ 
tions.  an  indispensable  part  of  the  process  is  the  characterization  of  the  speech  spectrum.  Recently, 
numerous  digital  speech  processing  techniques  have  been  developed  for  this  purpose  with  the  linear 
predictive  coding  (LPC)  being  the  most  widely  used  technique.  We  have  also  investigated  LPC 
analysis/synthesis  for  improving  low-bit-rate  voice  encoding  [1,2]. 


Fig.  1  —  Digital  voice  processors  for  secure  voice  and  human-computer  interactive  applications  The  speech  preprocessor 
(indicated  by  a  thick-lined  boxl  conditions  the  speech  signal  tor  subsequent  speech  analysis  for  various  applications  The 
qualitv  of  the  speech  preprocessor  significantly  affects  the  overall  system  performance.  During  the  early  days  of  2400-b  s 
voice  encoder  development,  refinement  of  the  front-end  analog  circuits  alone  improved  the  speech  intelligibility  by  as  much 
as  five  points,  which  demonstrates  the  importance  of  the  input  audio  processor 
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Strangely,  no  comprehensive  investigation  has  been  related  to  the  requirements  of  a  speech 
preprocessor  (often  known  as  the  front-enu  processor  or  speech  I/O*,  which  is  the  critical  link 
between  humans  and  computers.  Over  the  years,  we  have  designed  audio  circuits  based  on  somewhat 
lax  performance  specifications  because  severely  distorted  speech  can  still  provide  acceptable  speech 
quality.  For  example,  the  severe  distortion  in  the  speech  waveform  from  a  carbon  microphone  is  due 
to  the  random  modulation  of  the  electrical  resistance  caused  by  the  movement  of  the  carbon  granules. 
But  the  quality  of  the  telephone  speech  is  deemed  acceptable  to  a  majority  of  its  users.  Because  our 
ears  are  tolerant  to  speech  distortions,  speech  input  circuits  have  often  been  haphazardly  designed. 

But  the  speech  analyzer  in  the  digital  voice  processor  is  not  a  human  ear.  A  speech  waveform 
anomaly  not  objectionable  to  the  human  ears  (viz.,  peak  clipping  of  the  speech  waveform)  can  cause  a 
significant  deterioration  in  the  estimated  speech  spectrum.  The  speech  waveform  nas  a  wide  dynamic 
range  (60  dB  or  more),  and  it  is  difficult  to  maintain  a  correct  speech  level.  An  improper  speech  level 
is  one  of  the  reasons  why  a  voice  processor  optimized  in  the  laboratory  by  the  use  of  carefully 
prerecorded  speech  often  fails  in  the  field  because  of  the  varying  levels  of  live  speech. 

The  speech  preprocessor  presented  in  this  report  is  more  than  a  conventional  front-end  speech 
I/O  device  comprised  of  an  amplifier,  an  antialiasing  filter,  and  an  analog-to-digital  (A/D)  converter. 
Our  preprocessor  self-adjusts  the  speech  level  and  equalizes  the  microphone  frequency  response  and 
the  spectral  tilt  of  voices;  it  also  removes  various  interferences  detrimental  to  speech  analysis,  such  as 
distorted  microphone  response,  breath  noise  from  the  microphone,  digital  noise  in  the  analog  channel, 
60  Hz  hum,  unintentional  DC  bias  from  the  A/D  converter,  and  external  ambient  noise.  In  other 
words,  the  speech  preprocessor  conditions  the  speech  signal  to  produce  the  best  speech  analysis  result 
for  the  intended  applications  (i.e.,  speech  encoding,  speech  recognition,  or  speaker  recognition). 

Note  that  many  preprocessing  operations  will  be  digital  rather  than  analog  because  of  the  follow¬ 
ing  advantages: 

•  Miniaturization—  Elaborate  analog  circuits  are  a  hindrance  to  miniaturizing  voice  processors. 
Over  the  years,  weight  and  power  consumption  of  digital  voice  processors  have  declined 
steadily  (Fig.  2),  and  this  trend  will  continue.  Our  approach  to  speech  preprocessing  lends 
itself  to  future  hardware  miniaturization. 

•  No  Aging  Problem— The  performance  will  not  degrade  because  of  aging  components  in  the 
analog  circuits. 

•  Flexibility  anti  Power—  Digital  processing  has  more  flexibility.  For  example,  filter  charac¬ 
teristics  of  a  digital  filter  can  achieve  more  ideal  filtering  characteristics  (i.e.,  steeper  cutoff 
rate  and  linear  phase  response),  and  they  can  be  altered  more  conveniently  by  changing 
weights.  If  needed,  filter  weights  can  adaptively  be  changed  based  on  whether  the  speech  sig¬ 
nal  is  voiced  (which  needs  sharp  cutoff)  or  unvoiced  (which  does  not  need  sharp  cutoff  to 
bring  out-of-band  speech  energies  into  the  passband). 

The  topics  included  in  this  report  have  come  to  our  attention  while  working  with  various  voice 
processors  over  the  past  15  years.  We  hope  that  our  thoughts  and  experience  will  guide  the  designers 
of  tuture  secure  voice  and  human-computer  interactive  voice  systems. 
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Fig.  2  —  Power  consumption  and  weight  of  low-bit-rate  digital  voice  processors.  This  figure  shows 
how  the  advancement  of  digital  component  technology  has  contributed  to  the  reduction  of  both 
weight  and  power  consumption.  Sig  Sally  was  packaged  in  a  dozen  6  ft  racks;  KY-9  was  contained 
in  seven  19-in.  racks  (weighing  500  lb).  In  our  preprocessor,  the  operations  heretofore  carried  out 
by  analog  circuits  are  relegated  to  digital  signal  processing.  Reduction  of  analog  processing  has 
been  an  influential  factor  in  the  miniaturization  of  voice  processors. 

BACKGROUND  DISCUSSIONS 
Wide  Dynamic  Range 

Speech  is  a  difficult  signal  to  interface  with  a  signal  processor  because  speech  has  a  wide 
dynamic  range.  Peak  amplitudes  of  vowels  are  often  40  dB  greater  than  peak  amplitudes  of  fricatives 
(Fig.  3).  In  addition,  a  20  dB  difference  in  loudness  exists  from  one  speaker  to  another.  Therefore,  a 
front-end  processor  must  have  a  dynamic  range  of  at  least  60  dB.  Otherwise,  vowel  waveforms  will 
be  clipped  frequently,  or  weak  fricatives  will  be  lost.  If  any  of  these  occur,  the  performance  of  the 
voice  processor  will  be  degraded. 


Fig.  3  —  Speech  waveform  of  “help  ”  The  peak  amplitude  of  speech  varies  much  as  40  dB  within  a  fraction  of  a 
second,  as  noted  from  the  fricative  /hi  to  the  vowel  /el  in  this  example  A  wide  dynamic  range  is  a  notable  characteristic  of 
the  speech  waveform.  An  improper  front-end  gain  will  distort  the  speech  waveform.  One  critical  function  of  the  proces¬ 
sor  is  to  maintain  a  proper  input  speech  level 
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Perceptual  Tolerance  to  Speech  Distortion 

But  distorted  speech  is  not  too  objectionable  to  the  human  ear  It  has  been  lone  known  that  so- 
called  “peak-elippiniz"  d  speech  has  little  perceptual  effect  on  intelligibility  and  negligible  effect  on 
quality.  e\cn  if  10  to  12  dR  of  the  highest  \oice  peaks  are  eliminated.  During  World  War  It.  the  I  S. 
A  run  Signal  Corps  engineering  Laboratories  tasked  the  Psycho- Acoustic  ’  -'boritorv  of  Harvard 
Universitv  to  investigate  the  maximum  degree  of  amplitude  distortion  tolerable  in  a  communication 
system  ti  e.,  analog  communication  systems  such  as  the  telephone)  It  was  found  that  it  speech  were 
differentiated  prior  to  clipping,  even  infinite  clipping  retained  MO '7  to  05 of  the  original  intelligibil¬ 
ity  of  nonsense  monosyllables  1 3 1 .  For  two  reasons,  human  ears  are  insensitive  to  amplitude  distor¬ 
tions: 


•  Harmonic  Structure  of  \oiccd  Speech  Spectrum  —Voiced  speech  (vowel  sounds)  is  generated 
by  periodic  ringing  of  the  vocal  tract  by  the  glottis.  Therefore,  the  voiced  speech  waveform  is 
periodic  at  the  pitch  rate  (Fig.  4);  its  spectrum  is  concentrated  at  pitch  harmonics  (Fig.  4 1 . 
Note  that  amplitude  distortions  of  voiced  speech  do  not  create  cross  products  of  frequencies 
that  fall  between  pitch  harmonics  (or  inharmonic  sounds).  Hence  distorted  speech  is  not  too 
objectionable  to  our  ears 
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Hu  4  Speech  waveform  of  vowel  .e  .hhI  its  frequency  spectrum  Because  the  speech 
waveform  is  repetitive  .it  the  pitch  rate,  its  frequenev  components  arc  concentrated  at  pitch 
h*rMH?nics  Ihus.  cross  pr<n!uc(N  of  frequencies  irencr.ircd  In  nonlinear  distortions  are  also 
eoneentiated  at  pitch  harmonics  I  hat  is  uh\  distorted  speech  from  carhon  microphones  is 
not  too  objectionable  to  human  ears,  wnereas  music  t  whose  spectrum  is  er'irelv  different 
from  the  voiced  speech  spectrum i  sounds  temhle  over  the  telephone 
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•  Variability  of  Unvoiced  Speech  Spectrum— Un\ owed  speech  is  treated  by  turbulent  air  through 
a  eonstrietion  somewhere  in  the  vocal  (net.  Since  the  time  waveform  is  random,  its  spectrum 
is  also  random  (Fig.  5).  For  a  given  unvoiced  speech,  its  spectrum  varies  widely  from 
speaker  to  speaker  because  each  has  different  lip.  tongue,  and  teeth  clearances.  Distorted 
unvoiced  speech  of  one  speaker  can  sound  like  undistorted  unvoiced  speech  ol  another  person. 
This  is  whs  we  do  not  perceive  distorted  unvoiced  speech  as  being  objectionable 


SO 


Frequency  (kHz) 


Fig.  5  Speech  waveform  of  unvoiced  speech  s  and  its  frequency  spectrum  Unlike  the 
voiced  speech  spectrum,  the  unvoiced  speech  spectrum  is  random  A  distorted  unvoiced 
spectrum  of  one  person  may  he  similar  to  an  undistorted  unvoiced  spectrum  ol  another  per¬ 
son  That  is  why  distortions  of  unvoiced  speech  are  noi  ohieetionahle  to  our  ears 


Perceptual  Tolerance  to  Stationary  Phase  Shift 

In  addition,  our  perception  is  insensitive  to  certain  kinds  of  phase  distortions.  For  example,  a 
stationary  phase  shift  of  the  speech  spectrum  is  not  discernible  to  us.  To  illustrate  this  phenomenon, 
the  speech  waveform  is  passed  through  an  all-pass  tiller  whose  amplitude  response  is  flat. 

-It  /  l  |.  i  n 

and  phase  response  is  a  quadratic  function  of  frequency  ti  e.,  the  group  delay  is  a  linear  function  of 
frequency  t. 

Of/  >  =  67r(//4(XK)r  radians.  ( 2 ) 

where  /  is  in  hert/.  and  halt  the  sampling  frequency  is  4(XM)  H/  Although  the  input  and  output 
speech  waveforms  look  different  (Fig.  b).  thev  sound  exactly  alike.  They  must  be  heard  to  be 
believed' 
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Fie  6  —  Inpul  and  output  speech  wavetorr.is  lor  the  all-pass  tilter.  As  noted,  the  output  speech  waveform  is  dis¬ 
torted.  hut  both  the  input  and  output  speech  waveforms  sound  exactly  alike  Our  hearing  is  blind  to  time-invariant 
phase  shift. 


Effects  of  Speech  Distortion  on  Spectral  Estimation 

It  is  significant  to  point  out,  however,  that  what  .•>  acceptable  to  human  perception  is  very  dif¬ 
ferent  from  what  is  acceptable  to  the  digital  speech  processor,  which  tries  to  estimate  the  speech  spec¬ 
trum  by  a  limited  number  of  parameters.  For  example,  peak  clipping  of  speech  is  highly  detrimental 
to  speech  analysis,  such  as  the  LPC  analysis. 

The  LPC  analysis  is  based  on  the  assumption  that  a  given  speech  sample  x,  is  predicted  by  a 
weighted  sum  of  past  samples. 


it) 

■v,  =  £  ak  x,-k  +  (3) 

A  =  I 

in  which  a  set  of  weighting  factors  ak  is  estimated  by  minimizing  the  mean-square  prediction  errors. 
The  prediction  principle  does  not  hold  well  when  the  speech  waveform  is  clipped.  As  a  result,  the 
estimated  speech  spectrum  becomes  erroneous.  Figure  7  illustrates  the  effect  of  waveform  clipping  on 
the  LPC  spectrum. 

CRITICAL  DESIGN  ISSUES 

The  critical  issues  related  to  the  design  of  a  voice  preprocessor  include  microphone  frequency 
response  equalization,  input  coupling,  automatic  gain  adjustment,  digital  implementation  of  the 
antialiasing  filter,  and  reduction  of  various  forms  of  interference  (Fig.  8).  F.ach  item  is  discussed  in  a 
subsequent  section. 

1.  Microphone 

Noise-cancelling  microphones  are  used  in  all  military  platforms.  The  noise-cancelling  micro¬ 
phone  attenuates  undesirable  background  acoustic  noise,  and  it  also  attenuates  unintentional  informa¬ 
tion  (including  human  voices  in  the  background!  leaking  into  the  microphone 
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Analog  Processor  . -  .  Digital  Processor 
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Fiy.  8  A  preprocessor  tor  voice  processing  applications.  The  preprocessor  automatically  adjusts  the  speech  level,  removes 
speech  interference,  and  digitizes  the  speech  signal  with  a  nearly  ideal  antialiasing  filter  The  topics  of  discussions  are  indi¬ 
cated  in  shaded  Sues 


The  noise-cancelling  microphones  currently  in  use  are  the  first-order  gradient  microphones  that 
were  developed  in  the  1930s  by  Harry  Olson  |3|.  The  output  of  this  type  of  noise-caneeiling  micro¬ 
phone  is  proportional  to  the  pressure  difference  between  two  closely  spaced  elements.  The  output  of  a 
noise-cancelling  microphone  caused  by  a  sinusoidal  source  is  expressed  by 


AP  =  P„ 


sin  (it  -  r) 


+  -- 


2ir  cos  (2ir\)(ct  -  r) 


D  cos  d. 


(A) 


where  r  is  the  distance  to  the  sound  source.  Pm  is  a  proportional  constant,  D  is  the  separation  of  the 
two  microphone  elements.  X  is  the  waveform  length  of  the  sound  source,  c  is  the  speed  of  sound,  and 
6  is  the  signal  arrival  angle  measured  from  the  axial  direction  [3|. 

The  near-field  response  (which  has  r 2  in  the  denominator  in  Eq.  (4))  is  for  speech,  and  it  is 
independent  of  frequency.  The  far-field  response  (which  has  r  in  the  denominator)  is  for  noise,  and  it 
is  directly  proportional  to  frequency;  the  frequency  response  has  a  -6  dB/octave  attenuation  charac¬ 
teristic  toward  low  frequencies.  This  ideal  frequency  response,  however,  is  seldom  attained  in  practice 
because  of  the  complex  mechanical  structure  around  microphone  elements,  and  no  two  microphones 
from  differing  manufacturers  have  similar  frequency  responses. 


In  certain  operating  environments,  the  speech  signal  may  come  from  existing  intercommunica¬ 
tion  systems  or  audio  chains,  (n  this  case,  voice  processors  (either  voice  encoders,  speech  recogniz¬ 
ors.  or  speaker  recognizers)  must  work  with  the  existing  microphone.  Thus  it  is  worthwhile  to  review 
some  of  the  better  known  noise-cancelling  microphones  to  assess  the  amount  of  frequency  equalization 
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we  need  to  equal  i/e  each  microphone  We  show  t\ pical  mouth-to-microphone  sensitivities  to  indicate 
the  degree  ot  speech  level  fluctuations  expected  it  the  microphone  is  not  held  properlv  We  point  out 
also  that  the  putt  screen  is  essential  tor  noise-cancelling  microphones  because  thev  tend  to  distort  the 
onsets  ol  plosives. 

1. 1  Frequency  Response  t.iiualKation 

NRl.  has  surveyed  the  existing  microphones,  audio  systems,  and  ambient  noise  characteristics  in 
various  tactical  platforms  |4|.  This  program  was  tailored  to  assess  how  the  Advanced  Narrowband 
Digital  Voice  Terminal  (ANDVTt  would  perform  with  acoustic  noise  at  the  input  and  stress,  vibra¬ 
tion.  and  acceleration  applied  to  the  talker.  To  do  this,  both  trained  and  "walk  on"  speakers  read 
Diagnostic  Rhyme  lest  tDRTl  words  and  Diagnostic  Acceptabilitv  Measures  (DAM)  sentences  from 
military  platforms  while  engaging  in  realistic  maneuvers  Through  this  piogram.  a  vast  amount  ot 
data  related  to  microphones  and  audio  systems  was  collected 

Most  ot  the  presently  deployed  noise-cancelling  microphones  were  originallv  designed  tor  analog 
voice  communication  systems  where  a  Hat  frequency  response  was  not  essential  In  digital  voice  pro¬ 
cessors.  however,  the  presence  of  microphone  response  peaks  affects  adversely  the  estimation  of  the 
speech  spectrum.  Not  all  presently  deployed  noise-cancelling  microphones  have  a  flat  frequency 
response,  as  shown  in  Tig.  9.  They  must  i>v  equalized  to  have  a  flat  response 

Microphones  for  tracked  vehicles  (eg.  the  M-X7  and  M  138)  were  originally  designed  to 
attenuate  low  frequencies  to  filter  out  mechanical  rumbles.  Tor  speech  analvsis.  however,  a  lack  of 
low  frequencies  is  detrimental:  (a)  pilch  tracking  a. id  voicing  decision  h.-eume  less  reliable,  and  (b) 
recognition  of  nasals  (  m  ,  n  or  ng  )  become  difficult  because  they  have  mainly  low -trequenev  com¬ 
ponents.  Thus  our  recommendations  are: 

•  The  frequency  response  should  be  restored  to  the  ideal  flat  response  between  150  to  3800  Hz 

•  A  more  effective  digital  preprocessing  should  be  used  to  eliminate  noise. 

In  Section  5  we  present  a  noise  suppression  method  that  equalizes  microphone  frequency 
response.  In  this  method,  the  noise  suppression  is  earned  in  the  trequenev  domain.  Therefore,  the 
microphone  response  equalization  can  be  effected  comementlv  as  an  integral  part  of  the  noise 
suppression  at  a  small  computational  cost 

1.2  Mnulh-to-Mtcrnphonc  Sensitivity 

The  induced  speech  level  of  a  noise-cancelling  microphone  is  significantly  dependent  on  the 
mouth-to-microphone  distance.  An  improperly  held  microphone  is  a  ma|or  cause  of  speech-level  fluc¬ 
tuations  that  are  highly  detrimental  to  speech  analysis,  figure  10  shows  the  amount  of  speech  attenua¬ 
tion  expected  when  the  microphone  is  moved  1  4  to  I  in.  from  the  mouth.  The  average  magnitude  of 
speech  attenuation  is  somewhere  around  12  dB.  which  is  rather  significant,  considering  that  the 
microphone  could  be  easily  moved  by  3  4  in.,  even  while  trvmg  to  hold  the  microphone  steadilv  Our 
recommendation  is  that  the  preprocessor  shall  have  a  'elf- adjusting  amplifier  gam.  In  Section  3  we 
present  an  effective  software-controlled  automatic  gain  control  (ACiC’i  mechanism. 
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Frequency  (kHz) 


Fig.  Dt a t  Frequency  response  ol  the  IA-X40  the  TA-S40  handsel  is  widelv  deploved  lor  naval  communication  in  ship- 
hoard  environments  Several  different  noise-cancelling  microphone  elements  have  been  developed  lor  the  r.\  S4(l  This  par¬ 
ticular  microphone  element  produces  hass-heavv  and  not  too-intelligible  speech  sounds  In  addition,  bass  boosted  speech  can 
generate  acoustic  feedback  between  the  microphone  and  intercom  speaker  As  noted,  the  frequence  response  is  tar  from  Hat. 
it  should  be  equalized  before  performing  digital  speech  processing  The  method  ol  equalizing  the  frequence  response  is 
given  in  Section  5. 


big  4(bi  Frequence  response  ol  the  M  S'  noise-cancelling  microphone  The  M-87  is  a  boom  microphone  that  has  been 
widelv  used  be  the  Nave  and  Air  Force  in  airborne  and  tracked  vehicles  Low  frequencies  are  severelv  rolled  ott  ta  -4  dfi 
gain  at  approximate!}  SIX)  Hzi  to  attenuate  mechanical  rumbling  According  to  tests,  the  M  S’  outperformed  the  1  A  X4u 
when  used  as  an  LP(  trout  end  microphone.  As  in  the  FA  X40.  the  trequenev  response  ol  the  M  X"1  should  be  equalized 
prior  to  speech  processing 


Frequency  (kHz) 

f  lu  diet  -  (•requeue;,  response  of  (he  M-I3X  noise  cancelling  microphone  The  M-I3X  is  ,i  boom  microphone  and  has  been 
used  interchangeably  with  (he  M-X7.  Since  (he  speech  waveform  will  be  filtered  at  4  kHz.  a  peak  around  4.4  kHz  is  inconse¬ 
quential  to  the  performance  ot  the  voice  processor  Ihe  S1-I3K  has  a  better  trequenev  response  than  either  the  M-X7  or  TA- 
K40.  But  it  uid  not  work  as  well  as  the  M-X7  at  tank  platforms  where  low -trequenev  noise  is  predominant  Ihe  reason  is  that 
the  M-X7  severely  attenuated  low  frequencies,  whereas  the  N1-I38  microphone  actually  boosts  low  frequencies  that  should  be 
equalized 
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Frequency  (kHz) 


f  ie.  Midi  Frequence  response  of  the  M  42  noise -cancelling  microphone  The  M  M2  is  ,i  handheld  microphone  olten  used 
in  the  PH'  platform  The  frequency  response  below  2  kHz  is  nearly  ideal  A  7  dB  peak  near  3  >  kHz  should  be  equalized 
Accordme  to  tests,  the  M  M2  was  one  ot  the  better  microphones  lor  the  I  .PC'  I  rout  end 
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/..?  Breath  Noise 

The  leading  edge  of  a  sound  wave  is  often  spattered  on  impact  at  the  surface  of  the  microphone, 
creating  a  burst  of  noise  at  the  speech  onset,  particularly  at  onsets  of  plosives  such  as  p  This 
phenomenon  is  more  pronounced  with  the  noise-cancelling  microphone  because,  as  discussed  in  con¬ 
nection  with  Eq.  (3).  the  far-field  frequency  is  similar  to  a  high-pass  filter  (i.e..  a  differentiator).  The 
spectrum  of  /p /  normally  has  predominantly  low-frequency  components,  but  it  spreads  noticeably  after 
spattering  (Fig.  11).  The  resultant  spectrum  resembles  the  spectrum  of  t  . 


pee  po  pu  pee  po  pu 


la)  Microphone  wuh  shield  'hi  Microphone  without  shield 


lie.  II  —  Onset  spectra  ol  p  with  and  without  put  I 'screen.  Plosive  sounds 
without  a  pull  screen  in  the  microphone  lend  to  sound  like  l 

The  narrowband  LPC  tends  to  accentuate  these  distorted  plosive  sounds.  They  are  frequently 
heard  as  pops,  and  the  voicing  decision  tends  to  be  "voiced"  rather  than  "unvoiced"  as  it  should  be. 
Not  only  is  the  intelligibility  of  the  plosive  sound  itself  reduced,  but  conflicting  burst  and  vowel  tran¬ 
sition  information  (such  as  a  /t/- 1  ike  burst  followed  by  formant  movements  typical  of  p  )  can  be 
confusing  to  the  listener.  For  the  same  reasons,  the  speech  recognizer  will  be  confused. 

We  recommend  that  a  puff  screen  be  used  in  all  noise-cancelling  microphones.  According  to  our 
measurement,  the  use  of  a  puff  screen  does  not  alter  frequency  response  characteristics,  although  the 
speech  level  could  be  reduced  by  1  or  2  dB  across  the  entire  passband. 

2.  Input  Coupling 

Often,  the  audio  is  routed  from  a  subscriber  terminal  in  a  communication  center  through  a  cable 
to  the  voice  processor,  often  through  a  switchboard.  These  circuits  typically  are  balanced  6(XM1  audio 
lines  (usually  grounded  center  tap)  although  in  some  installations  they  may  be  unbalanced  (one  side 
grounded).  Therefore,  the  IT)  should  be  designed  to  satisfy  both  cases  to  prevent  hum  pickup,  low 
level,  and  loss  of  low  frequencies. 

To  avoid  improper  input  coupling,  many  systems  have  a  floating  input  provided  by  an  input 
transformer  (far  more  protective  from  transients  and  RF  than  a  differential  input  operational  amplifier) 
that  works  equally  well  for  both  balanced  and  unbalanced  inputs.  If  the  transformer  coupling  is  used, 
there  is  no  need  to  further  attenuate  low  frequencies  bv  the  antialiasing  filter  because  the  transformer 
inherently  attenuates  low  frequencies. 
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An  important  specification  ot  the  input  transformer  is  the  low -frequenev  cutoff  because  the 
transformer  si/e  is  more  or  less  determined  by  the  lowest  trequenev  to  be  transmitted  at  maximum 
level.  We  recommend  a  low -frequency  eutott  ot  150  H/.  A  number  ot  different  off-the-shelt 
transformers  are  marketed  tor  use  with  multitone  MODFMS  and  are  acceptable  tor  voice  appliea 
tions. 

3.  Automatic  Gain  Control 

A  most  dillieult  leqn  rement  for  the  audio  input  processor  is  to  maintain  a  proper  speech  level 
prior  to  the  digital  voice  roeessor.  As  discussed  in  the  Background  section,  a  small  amount  of  peak 
clipping  ot  the  speech  waveform.  caused  by  a  mismatched  gain,  can  cause  serious  consequences  to  the 
estimated  speech  spectrum. 

Good  reasons  tor  using  a  reliable  gain  control  mechanism  for  voice  processors  operating  in  tacti¬ 
cal  environments  are: 

•  Improper  luoullin g  of  mierophom — In  tactical  platforms,  noise-cancelling  microphones  are 
routinely  used  to  reduce  background  noise.  As  discussed  previously  ,  the  speech  level  of  a 
noise-cancelling  microphone  is  highly  dependent  on  the  mouth-to-microphone  distance.  An 
optimum  mouth-to-microphone  distance  is  14  in.,  but  when  it  is  increased  to  only  1  in.  by 
careless  handling,  the  speech  level  decreases  any  where  from  I  1  to  14  dB  (Fig.  10). 

•  Shoutins:— In  military  environments,  shouting  is  not  unusual  because  of  excessive  background 
noise  or  tense  operating  conditions.  The  speech  level  easily  jumps  10  to  20  dB  by  shouting. 

•  Operating  with  existing  audio  sx stems— In  certain  operation  environments,  the  voice  processor 
may  be  connected  to  existing  intercom  or  audio  systems.  They  may  have  different  gain  levels 
from  one  platform  to  another.  An  external  gam  mismatch  could  be  a  serious  problem  for 
achieving  a  proper  input  level. 

The  audio  front  end  could  be  equipped  w  ith  a  manual  gain  control  to  compensate  for  the  exter¬ 
nal  gain  mismatch.  Manual  gain  controls  have  reportedly  not  worked  well,  however,  because  the 
operators  in  the  field  often  did  not  know  how  to  adjust  them  properly.  Thus  it  is  desirable  to  have  an 
AGC  at  the  front  end  to  self- adjust  the  gain  in  accordance  with  the  input  speech  level. 

Not  all  AGC  devices,  however,  are  suitable  for  digital  voice  processors,  For  example,  a  fast- 
attack  and-xlow -release  AGC  is  not  appropriate  for  the  frame-by-frame  spectral  analysis  used  in  the 
digital  voice  processor.  Amplitude  variations  within  the  analysis  frame  caused  by  the  AGC  induce 
errors  in  the  estimated  spectrum. 

3.  I  Recommended  AGC 

In  our  approach,  the  necessary  gain  is  estimated  by  digital  computations.  The  estimated  gain  is 
then  led  back  to  the  analog  amplifier  of  the  front  end  of  the  digital  voice  processor.  The  gain  is  incre¬ 
mented  or  decremented  depending  on  the  error  signal  the  difference  between  the  reference  level  and 
the  quantity  den>cd  Irom  the  speech  (which  will  be  defined  short  I  v )  We  update  the  gain  during 
unvoiced  or  silent  periods  onlv  In  this  wav  the  speech  amplitude  is  not  altered  during  voiced  speech 
segments.  Figure  12  shows  the  gain  estimation  processor. 
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Fig.  12  —  Soil  ware-controlled  AG  C.  The  low -band  energy  is  processed  from  voiced  speech  (whose 
amplitude  is  as  much  as  40  dB  greater  than  unvoiced  speech).  The  first-moment  of  the  lowband  energs 
\M,)  is  compared  with  the  reference  level  The  reference  level  is  so  chosen  that  when  M,  equals  this 
level,  there  will  be  no  amplitude  clipping.  The  front-end  gain  (g, )  is  updated  during  unvoiced  or  silent 
periods. 


The  choice  of  input  variable  is  critical  to  the  AGC  performance.  We  chose  the  low-band  energy 
contained  below  I  kHz  as  the  input  variable  (i.e.,  the  first  formant  amplitude)  because  it  is  relatively 
independent  of  the  nature  of  speech.  Although  low-band  energy  is  being  averaged  over  a  short  time 
period  (20  ms),  it  has  some  fluctuations  caused  by  leakage  of  higher  formant  frequency  components 
and/or  acoustic  background  noise.  Thus  we  further  smoothed  the  low-band  energy  through  statistical 
averaging.  Time  averaging,  while  simpler,  is  not  as  good  as  ensemble  averaging  because  the  ampli¬ 
tude  of  low-band  energy  is  no*  niformly  distributed  (if  so,  time  averaging  would  be  equivalent  to  sta¬ 
tistical  averaging). 

To  facilitate  computations,  the  incoming  low-band  speech  energy  is  quantized  to  one  of  25 
values,  approximately  ±20  dB  around  the  reference  level  (Table  1).  We  chose  a  fixed  step  size  of 
1.75  dB  because  we  can  discern  a  loudness  change  of  that  magnitude.  Thus  the  quantized  low-band 
speech  energy  is 


v,  =  Fix,). 


(5) 
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where  .1,  and  v,  are  low-band  speech  energies  before  and  after  quantization  respectively,  and  F(  ) 
denotes  the  quantization  rule  listed  in  Table  1 . 


Table  1  —  Quantization  ot  Speech  Low-Band  Hnergy  Based  on  12-bit 
Representation  of  the  Speech  Waveform.  The  quantization  step  size  is 
1.75  dB.  The  reference  level  is  250  or  Step  13. 


Low -Band 
Speech 

Energy 

(X,) 

Quantized 
Low- Band 
Energy 
fv, ) 

22  or  less 

1 

27 

-) 

33 

3 

41 

4 

50 

5 

61 

6 

75 

7 

92 

8 

112 

9 

137 

10 

167 

1  1 

205 

12 

250  (Reference) 

.  _ 

Low  Band 
Speech 
Energy 
l  v, ) 

Quantized 

Low-Band 

Energy 

<v, ) 

306 

14 

374 

15 

458 

16 

560 

17 

685 

18 

8.37 

19 

1024 

20 

125.3 

21 

1534 

1870 

23 

2298 

24 

2813  or  more 

1C 

To  compute  the  probability  density  function  of  the  quantized  low -band  energy,  one  register  is 
assigned  to  each  quantization  level.  When  the  quantized  low-band  energy  is  equal  to  a  particular 
counter  index,  the  content  of  that  register  is  incremented  by  one.  The  contents  of  all  registers  are  then 
short-term  averaged  by  a  single-pole  filter  with  a  feedback  constant  of  1/32.  Thus. 

C,  (  Y  )  =  C,  ,<>')  +  1 .4.0')  -  C,  )(  Y  )  |  /  3  2 .  (6) 

where  C,(Y )  is  the  content  of  the  register  associated  with  quantization  level  during  the  i  th  voiced 
frame.  The  incremental  content  ( Y  )  is  expressed  by 

i  1 .  if  v,  =  Y 

4,0' )  =  •;  n  >7) 

'  1  ().  otherw  ise. 


In  Lq  (6).  the  feedback  constant  of  1  32  defines  the  width  of  an  exponentially  decaying  window 
for  the  register.  According  to  tests  with  a  real-time  simulator  using  a  variety  of  speech  samples  (noisy 
as  well  as  quiet),  a  feedback  constant  of  1  32  is  a  suitable  choice  in  terms  of  Cast  gain  settling  without 
introducing  undesirable  hunting  in  the  steady  state.  Note  that  the  feedback  constant  I  32  does  not 
directly  control  the  gain  update  rate,  the  gain  update  rate  is  an  incremental  gain  Ac,  ,  which  appears  in 
Hq.  (11). 

Vv  ith  updated  register  counts,  the  probability  density  (unction  of  the  low -band  speech  energy  is 
computed  by 
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p,y 


C,(Y\ 

t  0>'» 

>  i 


(8) 


The  error  is  defined  as  the  difference  between  the  reference  level  (RTF)  and  the  mean  value  ot 
the  low -band  energy.  Thus. 


f,  =  RTF  -  V  YPjY).  (9) 

>  i 

As  indicated  in  Table  I.  the  reference  level  is  13  (which  corresponds  to  a  low-bund  energy  level  of 
250  units  in  a  12-bit  A  D  conversion).  When  the  mean  value  of  low -band  energy  equals  the  reference 
level,  there  is  no  amplitude  clipping  oi  any  vowels. 

The  front-end  analog  amplifier  gain  in  decibels,  as  denoted  by  g, .  is  incrementally  adjusted  by 

g,  =  g,  i  +  Ag, .  (10) 

where  the  inuementai  gain  Ag,  in  decibels  is  nor. linearly  related  to  the  error: 

(  0.  if,,  •  <2. 

if  (,  <  -2.  (ID 


if ,,  >  2. 

The  transform  characteristic  has  a  dead  zone  near  the  reference  level  and  is  linear  elsewhere.  Thus,  if 
the  estimated  mean  of  the  low-band  energy  is  within  two  quantization  levels  (3.5  d B )  of  the  reference 
level,  no  gain  adjustment  is  made.  There  is  a  broad  range  of  acceptable  update  factors.  We  chose  the 
factor  I  32  after  experimenting  with  various  types  of  speech  input,  including  noisy  speech  and  lengthy 
two-way  casual  conversations  over  a  real  time  processor.  Our  decision  was  based  on  both  transient 
and  steady-state  performances,  in  particular  on  the  gain  settling  time  from  an  initial  gain  mismatch  as 
large  as  -28  dB 

J.2  Prototype  Performance 

The  ACiC  function  described  in  Section  3.1  has  been  tested  in  the  NRL-owned  programmable 
voice  processor  and  achieved  the  following  results. 

•  The  A CiC  established  the  necessary  gain  based  on  past  speech  statistics  Therefore  no  addi¬ 
tional  frame  delay  was  introduced. 

•  With  the  use  of  assembly  language,  the  computation  time  was  0.55  ms  for  voiced  frames  and 
0.015  ms  tor  unvoiced  frames  tone  frame  is  22.5  ms  or  180  speech-sampling  time  intervals). 
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•  With  an  external  gain  mismatch  from  0  to  -28  dB.  intelligibility  was  virtually  unaltered. 

•  When  the  input  gain  was  mismatched  by  as  much  as  —28  dB  initially,  the  steady -state  gain 
was  reached  within  2  s  after  the  initial  onset  of  voiced  speech. 

•  Once  the  steady  state  was'  reached,  there  was  no  noticeable  hunting.  This  condition  was  based 
on  a  30-min  recording  of  two-way  conversations  of  various  speakers. 

•  No  gain  pumping  was  observed  in  the  presence  of  severe  background  noise  (helicopter  noise). 

This  AGC  unit  was  field  tested  by  using  ANDV'T  over  HF  channels  (Fig.  13).  The  received 
speech  was  recorded  at  500  mi  away.  Transcriptions  of  recorded  voice  indicate  that  the  AGC  worked 
satisfactorily  for  various  voices  casually  speaking  in  conversational  and  text-reading  modes.  Figure  14 
is  a  segment  of  the  speech  spectrum  of  a  live  message  recorded  at  the  receiver. 

4.  Analog-to-Digital  Conversion 

In  the  conventional  front-end  processor,  the  analog  speech  signal  is  passed  through  an  analog 
antialiasing  filter  that  sharply  attenuates  frequencies  above  4  kFI/  (Fig.  15).  and  the  filtered  output  is 
digitized  at  a  rate  of  8  kHz.  In  our  approach,  however,  the  speech  signal  is  sampled  at  a  rate  of  16 
kHz.  and  the  necessary  filtering  is  carried  out  digitally  (Fig.  15).  According  to  our  experience,  there 
is  no  need  for  an  8  kHz  low-pass  filter  prior  to  A/D  conversion  because  no  significant  speech  energy 
exists  beyond  8  kHz. 


F  ig  13  —  HF  lest  of  AGC  with  a  secure  voice  terminal  The  AGC  seas  installed  in  the  ANDVT  The 
2400-h  s  speech  was  transmitted  over  the  upper  sideband  of  an  HF  channel  trom  a  C.S.  Navy  ship  to 
a  shore  station  500  mi  away  At  the  same  time,  another  voice  terminal  with  a  manual  gain  control 
transmitted  the  same  voice  over  the  lower  sideband  of  the  same  HF  channel.  F ^inscriptions  ol  both 
speeches  at  the  receiver  indicate  that  the  ANDVT  with  the  AGC  provided  a  consistently  better  matched 
speech  level  than  the  voice  terminal  with  a  manual  gam  control  This  is  art  example  of  where  a  manual 
control  is  not  useful  because  the  operator  in  the  field  does  not  know  how  to  adiust  it  properly  1  he  use 
of  an  AGC  at  the  preprocessor  is  recommended. 
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(bi  Our  approach 

his;.  15  Old  anil  new  approaches  to  A  I)  conversion  In  our  approach,  the  necessan  tillering  is 
elicited  b\  digital  computations,  which  can  more  readiK  attain  ideal  filtering  characteristics,  such 
as  a  sharp  cutoff  rate  and  a  linear  phase  response  over  the  entire  passhand.  than  can  an  analog 
filter  (see  hie.  lb) 
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4. 1  Digital  Antialiasing  Filter 

A  typical  antialiasing  filter  has  a  cutoff  characteristic  on  the  order  of  -  1X0  dB  per  octave  For 
example,  a  4  kHz  antialiasing  filter  has  roll-off  characteristics  of:  0  dB  at  3.6  kH/,  -20  dB  at  4  kH/. 

-40  dB  at  4.4  kHz . —180  dB  at  7.2  kH/.  The  impulse  response  of  an  antialiasing  filter  may  he 

obtained  from  the  following  Hamming-windowed  Fourier  series. 

I  G  0.54  —  0.46  cos  !  !  ,  0.5  +-  M  cos  -— -  0.5  .  tor  0  <  i  <  /  I  . 

*1/1=  /  I  1  I  — ,  1  / 

i  "  1 

10.  otherwise  l'“* 

where  the  factor  G  makes  the  sum  of  the  impulse  response  unity  tie.  a  DC  gain  of  unity  ).  The  quan¬ 
tity  /  is  the  total  number  of  impulse  response  samples  and  is  related  to  the  attenuat.on  rate  bevond  the 
cutoff  frequency  .  The  quantity  .V  is  related  to  the  cutoff  frequency  for  a  given  value  of  /.  The  impulse 
response  i>  symmetric  with  respect  to  the  midpoint  Thus  the  phase  response  is  linear 

A  4  kHz  low -pass  filter  with  a  frequency  roll-off  rate  of  approximately  1X0  dB  per  octave 
may  be  realized  by  letting  /  43  and  V  22  in  Fq.  (12)  On  the  other  hand,  a  6  kHz  low-pass 

filter  with  a  similar  frequency  roll  oft  characteristic  may  be  realized  by  letting  /  43  and  .V  33. 
I'he  impulse  responses  of  these  filters  are  listed  in  fable  2.  and  their  frequence  responses  are  shown 
in  Fig.  16 

4  2  Down  \iWiph  r 

It  a  4  kFl/  antialiasing  tiller  is  used,  the  filter  output  is  down  sampled  by  a  factor  of  two  to  one 
thus  every  other  sample  is  skipped 

On  the  other  hand,  it  a  0  kHz  antialiasing  filter  is  used,  the  filter  output  is  down-sampled  bv  a 
factor  ol  lour  to  three  In  other  words,  every  tour  consecutive  samples  produces  three  consecutive 
samples  These  three  consecutive  samples  are  obtained  bv  interpolation  Thus 


Mil 

Mil  •  1  1  '  ||  (I  2  1 

Mll| 

\  i2i 

>  1 2 1  *  1 2  *  >(  m  '  i 

u2i| 

1  13 

\  (  '  1 

Ml. 

where  till.  ti2c  \i3i.  and  v  1 4  >  are  tour  consecutive  input  samples,  and  vtli.  \i2i.  and  v  (  >  t  ate 
three  consecutive  output  samples  (torn  the  downsampler 

5.  Speech  Signal  Conditioning 

Four  types  of  input  interferences  are  often  encountered  m  the  speech  waveform  I  itev  are  an 
unintentional  DC  bias  generated  bv  the  A  D  converter,  (id  Hz  hum.  digital  noise  puked  up  bv  the 
analog  circuit,  and  ambient  acoustn  noise  We  discuss  methods  tor  suppressing  these  inter terences 
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Table  2  —  Impulse  Responses  of  -1  and  6  kHz 
Antialiasing  Filters.  The  4  kH/  filter  is  tor  speech 
encoding:  the  6  kHz  filter  is  for  peech  or  speaker 
recognition. 


Impulse  Response  h(j) 

Index  l/) 

n 

4  kH/  cutott 

6  kHz  cutoff 

-  - 

/  =  43.  ,V  =  22 

II 

u 

> 

II 

1  and  43 

0.00103 

-0.00005 

2  and  42 

0.001 12 

-0.001 12 

3  and  4  1 

-0.00171 

0.00219 

4  and  4() 

-0.00174 

-  0.00232 

5  and  39 

0.00314 

0.00068 

6  and  38 

0.00271 

0.0027 1 

7  and  37 

-0.00557 

-0.00629 

8  and  36 

-0.00396 

0.00727 

9  and  35 

0.00930 

-  0.00329 

10  and  34 

0.00538 

-0.00540 

1  1  and  33 

-0.01459 

0.01482 

12  and  32 

-0.00687 

-0.01846 

1 3  and  3 1 

0.02282 

0.01079 

14  and  30 

0.00829 

0.00831 

15  and  29 

-0.03505 

-0.03119 

16  and  28 

-0.00954 

0.04397 

17  and  27 

0.055X1 

-0.03242 

1 8  and  26 

0.01052 

-0.01054 

19  and  25 

-0. 10113 

0.07938 

20  anu  24 

-0.01 113 

-0.15601 

21  and  23 

0.31613 

0.21606 

“)  *> 

0  51046  ;  0.76181 
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Frequency  (kHz) 
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I  h~  I  n\jiicih )  .'i-'j'i'jK!  4  .mil  »'  kHz  anuahasmy  niters  rcali/ed  b\  diyital  liltcrmy.  V.Kantaues  "t 

ii'iniz  iliyii.il  .mluli.i'-'iii:  •  > i t cr -  are  <li  in  hand  trequenw  ripples  arc  neyliyihlv  Miutl  (less  dun  00'  dBi.  i2i 
trei|ueii.  v  r . >1 1  .'ll  i .lie-  arc  sleep  'leepet  than  ISO  .IB  .vt.no.  c  o  there  are  no  return  eaun  such  a-,  is  otter! 
oKet'e.!  m  anaioy  tilicis.  and  .4,  phase  lespoiwes  are  linear  ninetiom  ot  trequenev  o  e  .  Jitteremi.il  group 


,\  /  IK  Bias  Rt  nmuil 

IK  hi. te  is  otten  generated  within  t  he  A  I)  convertor  hoc. to  so  of  component  deterioration  in  tho 
output  balanee  circuit.  I .thlo  '  lists  tho  A  I)  convortoel  output  of  our  recently  acquired  signal  proces¬ 
sor  i  he  magnitude  ol  IX  ht.is  ,>  ,du.  .im.gly  large.  W  e  cannot  ignore  the  IX'  offset  in  the  12-hit  ,-\  0 
converter  when  its  magnitude  is  ,ts  large  as  lour  hits  t  lahle  5(a)).  IX  huts  is  probablv  generated  alter 
equipment  has  been  deployed  and  people  notice  the  degraded  voice  processor  pertormance.  Thus  the 
voice  preprocessor  must  he  capable  of  removing  DC'  bias. 

In  the  I. IX  analysis,  a  speech  sample  is  represented  hv  a  weighted  smn  ot  past  samples  (see  Hq. 
k'o  In  malm  notation.  I  q  i  m  mav  he  represented  hv 

V  \  I  •  /  i  1 4 1 

I  he  solution  Bn  I  iprevlietion  cool ticients i  that  mimmi/es  the  mean  square  errors  is 

t  (  V  '  A  i  1  t  V  7 At.  (15) 
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Table  3  -  12  Bit  AT)  Converter  Output  Samples  twith  the  input  mounded), 

la)  without  DC  suppression,  lb)  with  DC  suppression 
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where  i.Y^Y)  is  the  autoeorrelation  matrix  of  the  input  xpeeeh  samples.  When  speech  amplitude  is  low 
and  the  DC  bias  is  large,  the  autocorrelation  matrix  in  Hq.  (15)  tends  to  have  row  elements  with  simi¬ 
lar  numerical  values,  making  inversion  of  the  matrix  impossible.  (The  situation  is  similar  to  finding 
intersections  of  parallel  lines.)  Such  an  event  creates  a  number  of  undesirable  effects: 

•  The  initial  consonant  intelligibility  is  reduced,  particularly  for  b\  d.  and  n, ,  which  are  dif¬ 
ficult  to  characterize  even  when  DC  bias  is  absent. 


•  The  speech  synthesizer  tends  to  generate  annoying  pops  or  flutters  when  speech  is  absent. 

The  DC  offset  present  in  the  A  D  converter  output  may  be  removed  by  a  simple  DC-suppression 
filter  made  of  a  pole  at  =  1  and  a  zero  at  :  =  «.  where  a  <  1  (Fig.  17).  Thus  the  transfer  func¬ 
tion  of  the  DC  suppression  filter  is  expressed  by 


//,(.-) 


( lb) 


where  factor  a  is  related  to  the  3  dB  cutoff  frequency  (Table  4).  The  factor  (1  -*-  <*)/2  in  Hq.  (lb)  is 
to  make  the  passband  gain  unity  (i.e..  Hi:)  =  I  at.:  -  -1).  Since  this  factor  is  nearly  umtv .  it 
may  be  replaced  with  1.0  to  save  the  computation  time,  the  consequence  is  that  the  speech  amplitude 
becomes  a  fraction  of  a  decibel  lower.  Figure  IX  shows  the  frequency  response  of  this  DC- 
suppression  filter. 


5.2  60  H:  Hum  Reduction 


Often,  faulty  input  coupling  causes  bO  Fiz  hum  pickup.  The  presence  of  bO  FI/  noise  could  be  a 
serious  problem  for  the  digital  voice  processor.  The  design  of  a  DC  suppression  filter  is  simil;  the 
DC  suppression  presented  in  the  preceding  section  T  he  zero  and  pole  id  the  bO  FT/  suppression  filter 
have  an  argument  that  corresponds  to  ±b()  FT z  (i.e  ,  ±7000  4000)  =  ±0.015tt  radians),  as  illus¬ 

trated  in  Fig.  Id  Thus  the  transfer  function  of  the  b()  Hz  suppression  filter  is 
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Fig.  17  —  Zero  and  pole  of  the  DC  suppression  filter  The  zero 
located  at  .:  =  1  is  indicated  b>  •;  the  pole  located  at  ;  =  r»  to 
<  1 1  is  indicated  by  ■.  The  magnitude  of  <<  controls  the  cutoff  fre¬ 
quency  (see  Table  4).  Compare  this  figure  uith  Fig-  Id  IPO  Hz 
suppression  filter) 


Table  4  —  Cutoff  Frequency  in  Terms  of  Filter 
Parameter  a.  Any  cv  values  between 
0.875  and  0.925  are  acceptable 


Filter  Parameter  « 

-3  dB  Cutoff  Frequency 

0.855 

2(H)  Hz 

0.865 

185  ! 

0.875 

170 

0  885 

156  | 

0.895 

142  1 

0.905 

128 

0.915 

114  ; 

0.925 

100 

24 


Gain  (dB) 


NRl  RHPORT  4:06 


Frequency  (kHz) 

12  3  4 

■20  ~ 

■30  " 


Fig.  18  —  Frequents  response  ai  the  DC  suppressor  The  response  rises  smoothls  from  0 
H/.  and  no  in-hand  frequents  ripples  occur  that  could  he  detrimental  to  the  l.PC  analysis 
This  figure  is  plotted  for  <*  =  0.885,  and  the  -3  dB  cutoff  frequents  is  156  FI/ 
(see  Table 


Fig  14  Zero  and  pole  of  the  60  FI/  suppression  filter  The  zero  is  indicated  hs  0. 
the  pole  is  indicated  bs  ■  Compare  this  figure  ssuh  Fig  17  (DC  suppression  filter) 
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where  a  is  the  parameter  that  controls  the  notch  bandwidth  around  60  Hz.  and  G  is  a  gain  factor  that 
makes  the  passband  gain  unity  (i.e..  H:(z)  =  1  at  c  =  -  1). 

Equation  (17)  may  be  simplified  as 


,,  „  I  -1.99777988  ;  1  +  c  : 

H^(z)  -  G - :  - — 

1  -  2«(0  9988899);  1  +  a~z~' 

where 


(18) 


(1  +  cr)  +  2a(0. 9988899) 
3799777988 


(19) 


Although  the  DC  suppression  filter  has  a  w  ide  range  of  acceptable  values  of  a  (see  Table  4).  the 
60  Hz  suppression  filter  has  only  a  limited  range  of  values  of  a  because  the  frequency  response  must 
rise  sharply  beyond  60  Hz.  We  recommend  u  -  0.94;  and  the  transfer  function  of  the  60  Hz 
suppression  filter  becomes 


H2(z) 


0.9409005 


1  -  1.99777988  c  1  +  c  : 
!  -  1.877913  c  ■■'  +  .8836  : 


(20) 


Figure  20  shows  the  frequency  response  of  this  60  Hz  suppression  filter. 


Frequency  (kHz) 


12  3  4 


-OQ 

t  ie.  20  Frequence  response  ot  a  simple  60  H/  notch  tiller  Often,  a  Gulp  input  coupling 
introduces  60  Hz  noise  As  noted,  this  60  H/  suppression  filter  also  removes  the  IK'  component 
hv  17  27  dB  Thus  this  60  Hz  suppression  filter  can  he  used  as  a  eeneral  purpose,  low 
frequence  cutoll  filler 
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5. 5  Digital  Noise  Redaction 


The  presence  of  noise  (such  as  digital  noise  pickup  by  the  analog  circuits i  limits  ihe  a\ailable 
dynamic  range.  The  digital  noise  pickup  can  he  controlled  by  isolating  the  analog  circuits  from  the 
digital  circuits  and  by  filtering  power  lines  if  a  common  power  supply  is  used  We  recommend  that 
the  magnitude  of  digital  noise  be  within  one  or  two  least  significant  bits  for  the  12-bit  A  I)  convertet 
output.  This  criterion  is  not  difficult  to  meet.  We  recommend  that  all  analog  circuits  he  placed  in  a 
separate  copper  can  to  minimize  digital  noise  pickup. 

5.4  Ambient  Noise  Reduction 

When  speech  is  transmitted  digitally  at  low  bit  rates,  speech  intelligibility  is  degraded  by  as 
much  as  15  to  25  points  because  of  pitch  and  voicing  errors  and  the  inability  of  the  filter  coefficients 
to  describe  accurately  the  complex  spectra  of  noisy  speech.  Likewise,  a  voice  recognizer  score  that  is 
827?  accurate  at  85  dB  sound  pressure  level  (SPL)  (i.e..  office  environments)  scores  only  l.V?  in  I  15 
dB  SPL  (i.e..  helicopter  platforms)  (4 j .  Since  voice  processors  often  operate  in  noisy  military  plat¬ 
forms.  reduction  of  ambient  noise  is  a  significant  objective  of  preprocessing. 

Spectral  Noise  Subtraction  Method 

We  tested  a  spectra)  subtraction  method  that  is  a  family  of  frequency -domain  noise-reduction 
techniques:  it  subtracts  the  estimated  short-term  amplitude  spectrum  of  noise  from  the  short  term 
amplitude  spectrum  of  noisy  speech.  The  resultant  spectrum  is  converted  to  the  speech  signal  by  using 
the  original  input  phase  spectrum  (implying  that  no  steps  are  taken  to  refine  the  phase  spectrum).  This 
technique  has  been  investigated  extensively  by  Lin1  |5).  Berouti  et  al.  |b).  Boll  |7).  Weiss  et  al.  ] 8 J . 
and  others.  They  all  perceived  that  the  output  speech  was  improved  by  incorporating  a  number  of 
artifacts,  including  the  oversubtraction  of  spectra  and  the  amplitude  transformation  of  the  individual 
spectrum.  Thus  the  estimated  speech  spectrum  >s  often  denoted  by  the  general  form 

I.V(A)  '*  =  Ytk )  “  -  y  ,V(A-  > !“.  k  =  0.  1.2 . 127.  (21) 


where  Nik)  '•  is  the  estimated  kth  amplitude  spectral  component  of  noise  that  must  be  updated  during 
the  absence  of  speech.  ;  Yik) !  is  the  kih  amplitude  spectral  component  of  the  noise  suppressor  input; 
and  Sik)'  is  the  estimated  kth  amplitude  spectral  component  of  speech  (i.e..  noise  suppressor  out¬ 
put)  In  Hq.  (21).  g  >  1  and  ,i  >  1.  Berouti.  Schwartz,  and  Makhoul  |fcj  used  g  =  2  with  an 

adjustable  d:  Boll  |7]  used  -,  =  1  and  g  -  1;  and  Weiss  et  al.  |8|  used,  in  effect.  =:  1  with  an 

adjustable  g.  We  used  a  set  of  parameters  (i.e  .  7  =  I  and  u  =  2).  Thus. 

,S(k)-:  =  1  Yik ) :  ’  Nik )  ■ : .  k  =  0.  1.2 .  127.  (22) 

[equalization  of  Microphone  Response 

We  introduced  equalization  of  the  microphone  frequency  response  in  Hq.  (22)  by  weighting  the 
speech  spectrum  b\  the  differential  gain  between  the  actual  microphone  response  and  the  ideal  flat 
response  The  speech  spectrum  with  equalized  microphone  response  is  expressed  by 

Stk)  :  -  B  (A  )|  Yik )  :  -  i.V(Jt)  ;1.  k  -  t).  1.  2 .  127  (25) 
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where  W'(A.\i  is  the  kth  differential  gain  (in  power  ratio,  not  in  decibels)  between  the  actual  micro¬ 
phone  response  and  the  ideal  flat  response  for  the  Xrth  frequency.  Note  that  the  index  k  is  incremented 
by  a  frequency  step  of  (4000 .1281  =  31.25  Hz  We  need  not  compensate  the  microphone  response 
outside  150  and  3800  Hz.  Thus.  H  (A )  =  1  if  k  <5  and  k  >  122. 

Additional  Factors 


Even  if  the  values  of  p  and  are  fixed  ti.e..  p  = 
malice  i>  dependent  on  other  salient  factors  not  explicitly 


1  and  =  2).  the  noise-suppression  perfor- 
shown  in  Eq  (23)  The^s*  factors  are: 


1.  Spectral  analysis— The  180  speech  samples  <M‘  the  current  frame  were  overlapped  with  the 
76  trailing  samples  of  the  previous  frame  through  trapezoidal  windowing.  We  chose  a 
frame  of  76  samples  because  the  resulting  256  samples  permit  the  use  of  a  standard  FFT 
for  the  time -to- frequency  transformation. 

2.  Minimum  spectral  floor— If  the  subtracted  spectrum  in  Eq.  (23)  (i.e..  the  left-hand 
member)  was  less  than  zero,  it  was  replaced  with  zero  because  the  amplitude  spectrum 
cannot  be  negative.  But.  as  noted  by  Berouti.  Schwartz,  and  Makhoul  |8J.  a  small  amount 
of  spectral  floor  improved  the  output  speech  quality.  They  used  a  minimum  spectral  floor 
of  somewhere  between  -20  and  -46  dB  with  respect  to  the  estimated  noise  spectrum.  In 
our  DAM  tests  (where  p  ~  2).  we  used  a  fixed  value  of  -34  dB.  Thus 

I  Sot)  1 2  >  0.02  i  X(k)  I  2.  k=  0.12 . 127.  (24) 


3  Xoise  spectrum  updating  period—  In  the  noise-suppression  technique  that  makes  use  of  a 
single  microphone,  the  noise  spectrum  is  available  only  when  speech  is  absent.  Therefore, 
the  noise  spectrum  should  be  updated  only  in  the  absence  of  both  voiced  speech  and 
unvoiced  speech.  In  high-noise  environments,  however,  unvoiced  speech  is  difficult  to 
detect  because  ambient  noise  is  often  much  louder.  Voiced  speech  has  considerable  energy 
at  the  first  formant  frequency,  and  most  platform  noises  do  not  have  strong  resonant 
frequencies  in  the  first  formant  region.  As  illustrated  in  Fig.  21.  the  histogram  of  low-band 
energy  swings  between  larger  values  (when  voiced  speech  is  present)  to  smaller  values 
(when  voiced  speech  is  absent).  During  each  frame,  we  obtained  the  low -band  energy  of 
the  current  frame  by  simply  summing  the  first  32  spectral  power  densities  available  for 
spectral  subtraction.  The  past  history  of  low-band  energy  was  scanned  to  determine  the 
maximum  (MAX)  and  minimum  (MIN)  values.  We  updated  the  noise  spectrum  when  the 
current  low-band  energy  iP)  was  below  a  threshold  set  at  one-eighth  of  the  distance  from 
MIN  to  MAX 


P  <  MIN  +  (MAX  -  MlN)/8. 


(25) 


Although  we  updated  the  noise  spectrum  during  unvoiced  frames,  the  effect  was  not  too  adverse 
because  unvoiced  speech  was  generally  brief  in  comparison  to  the  silent  periods  between  phrases. 

4.  Xoise  spectrum  adaptation— The  first-order  low-pass  filtering  given  by 

\N(k)\2  -  G]X(k)\2  +  (1  -  G)l  Y(k )  I : .  k  -  0.  1.  2 . 127 


(26) 
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Fig.  21  —  Hologram  of  speech  energy  below  1  kH/  contained  in  the  trapezoidallv  amplitude- 
weighted  256  samples  ti.e..  short-term  averaged,  low-band  energy).  The  two  sentences  were 
spoken  at  a  helicopter  platform  where  the  noise  level  was  as  high  as  115  dB.  When  voiced 
speech  is  absent,  the  level  of  low-hand  energy  is  close  to  the  minimum  value  (MIN)  observed  in 
the  past  10- s  history .  The  noise  spectrum  is  updated  when  the  low  -band  energy  of  the  present 
frame  iPi  is  below  the  threshold  level  that  is  indicated  by  the  heavy  line.  As  noted,  the  noise 
spectrum  is  updated  during  long  gaps  between  sentences  and  brief  gaps  (a  few  frames)  between 
words. 


is  adequate  for  updating  the  noise  spectrum.  In  Eq.  (26).  G  is  a  feedback  factor  that  is  nor¬ 
mally  G  =  15/16.  Quicker  update  is  preferred  when  the  input  spectral  density  is  less  than 
the  estimated  noise  spectral  density.  In  this  case,  we  used  G  =  3/4.  This  noise-spectrum 
adaptation  method  proved  adequate.  We  observed  that  a  suddenly  appearing  interfering 
tone  during  speech  was  effectively  cancelled  within  a  quarter  of  a  second. 

Prototype  Performance 

We  selected  1  1  different  types  of  noisy  speech  samples  actually  recorded  at  military  platforms 
and  an  office.  Figure  22  is  an  example  of  spectrograms  before  and  after  noise  reduction  when  we 
used  the  speech  samples  recorded  at  a  P3C  cruising  at  a  high  altitude  (the  noise  level  is  105  dB).  The 
noise  suppressor  reduced  noise  by  15  dB.  and  the  output  spectrum  is  remarkably  free  of  ambient 
noise. 


We  also  evaluated  the  prototype  noise  suppressor  through  the  2.4-kb/s  speech  encoder.  The  per¬ 
formance  was  evaluated  by  the  standardized  speech  quality  test,  the  Diagnostic  Acceptability  Measure 
(DAM).  It  evaluates  the  amount  of  speech  distortion  in  terms  of  hissing,  buzzing,  rumbling,  babbling, 
fluttering,  muffled,  nasal,  unnatural,  cracking,  thin,  and  harsh.  Hence  the  test  would  indicate  the 
effectiveness  of  noise  suppression  when  the  voice  processor  is  tested  with  and  without  noise  suppres¬ 
sion.  The  following  is  a  summary  of  test  results: 
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(al  Noise-suppressor  input 
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(hi  Noise-suppressor  output 

f  ig.  22  Spectrograms  of  noisy  speech  recorded  at  a  P3C  platform.  The  turboprop  noise;  generated  by  the  P3C  ts 
rather  stationary  with  two  prominent  resonant  frequencies.  The  noise  suppressor  removed  most  of  the  prop  noise. 


1 .  Average  quality  score  improvement— The  voice  quality  improved  for  every  speech  material 
we  tested.  The  average  score  improved  6  points,  which  is  substantial.  In  the  shipboard 
environment  where  the  noise  level  was  only  76  dB,  the  score  was  improved  by  only  2.6 
points  (see  Fig.  23). 

2.  No  adverse  effects  with  noise-free  speech — Spectra!  subtraction  did  not  degrade  the  quality 
of  noise-free  speech,  contrary  to  some  of  the  previously  tested  noise-suppression  tech¬ 
niques.  In  fact,  spectral  subtraction  removed  even  the  background  hiss  produced  by  the  ori¬ 
ginal  analog  tape.  As  a  result,  speech  quality  improved  from  50.2  to  52.7. 

3.  Ixuist  improvement  with  nonstationary  noise—  As  expected,  a  noise-subtraction  method 
making  use  of  a  single  microphone  did  not  perform  well  with  nonstationary  noise  (such  as 
the  fluttering  wind  noise  encountered  in  a  moving  jeep)  because  a  slowly  updated  noise 
spectrum  cannot  compensate  effectively  tor  the  rapidly  changing  incoming  noise  spectrum. 

4.  Dramatic  improvement  with  relatively  stationary  noise — If  the  noise  is  relatively  stationary 
as  is  the  P3C  noise,  the  performance  of  the  spectral  subtraction  method  is  remarkable.  The 
score  improved  from  32.0  to  45.2  .  which  is  comparable  to  the  quality  improvement  in  a 
noise-free  environment  when  random  bit  errors  are  reduced  from  5%  to  0,5%. 

6.  Spoken  Voice 


The  spectral  characterization  of  speech  improves  if  the  talker  pronounces  words  slowly  and  dis¬ 
tinctly  The  use  of  a  delayed  sidetone  is  an  effective  way  to  induce  the  talker  to  articulate  more 


Quiet  I  Office  I  ABCP  I  SHp  I  E3A  I  Helicopter 
Helicopter  Jeep  P3C  Tank  Destroyer 

Carrier 

F,g  23  —  Speech  quality  scores  with  and  without  the  spectral  subtraction  method.  Speech 
quality  was  improved  in  all  cases,  and  the  speech  quality  was  upgraded  from  "very  poor" 
to  "poor."  "poor"  to  "fair."  etc.  These  descriptive  terms  were  devised  by  the  Digital 
Voice  Processor  Consortium,  which  has  been  testing  voice  processors  since  1972. 

slowly  and  conscientiously.  Likewise,  if  the  speech  spectrum  is  balanced  between  low  and  upper  fre¬ 
quency  bands,  the  result  of  speech  analysis  improves.  We  present  a  method  of  equalizing  the  speech 
spectral  envelope. 

6.  /  Sidetone  Consideration': 

Sidetone  is  an  acoustic  feedback  of  the  speaker's  own  voice  to  the  earphone  of  the  handset  used 
for  transmission.  In  the  full-duplex  telephone,  sidetone  is  superimposed  with  the  received  signal  from 
the  other  end,  including  line  noise  and  the  speaker's  voice. 

Sidetone  at  the  speaker's  site  performs  many  benefits.  Richards  [9]  found  that  the  relative  loud¬ 
ness  of  sidetone  influences  how  loud  a  person  talks.  The  absence  of  any  sidetone  indicates  to  the 
talker  that  the  line  is  dead.  What  is  more  important,  the  quality  of  sidetone  gives  the  talkers  some 
idea  of  the  quality  of  the  connection,  which  influences  their  manner  of  talking.  For  example,  a  very 
noisy  line  (as  evidenced  by  a  noisy  sidetone)  usually  encourages  the  talker  to  speak  louder.  Likewise, 
a  line  with  echo  may  influence  the  talker  to  speak  more  slowly  and  distinctly. 

Black  (10)  found  that  delayed  auditory  feedback  causes  a  slowdown  in  the  talking  rate  in  propor¬ 
tion  to  sidetone  delays  of  up  to  200  to  250  ms.  Too  much  delay,  however,  causes  articulation  distur¬ 
bances.  which  have  been  extensively  studied  and  documented  (II).  NRL  also  conducted  delayed  side- 
tone  experiments  that  confirmed  the  previous  findings;  namely,  talking  slows  down  with  increasing 
sidetone  delays  of  up  to  100  ms  (Fig.  24)  (12) 


KAMI.  I  RANNI  V  AM)  MORAN 


Sidetone  Delay  inic; 


fig  24  Reading  tune  vs  sidetone  delays  In  this  experiment,  a  list  ot  12  sentences  and  tour  24-word 
lists  lone  list  each  ot  one.  two.  three,  and  four-syllable  words!  were  read  while  hearing  one's  own  voice 
delayed  m  the  earphone  of  the  handset  |12|  The  mean  value  ot  reading  time  with  no  deluv  is  21  (is.  and 
reading  time  increases  linearly  to  26  s  with  a  delay  of  1  ( K)  ms 

Previously.  NRL  specified  a  delay  of  30  ms  for  ANDVT.  a  tactical  24(X)-b/s  secure  telephone 
for  tri-service  use  (mentioned  in  Section  3.2).  A  delay  of  30  nts  is  relatively  small,  and  it  will  not 
affect  communication.  (The  Bell  System  does  not  use  echo  suppression  if  echo  delay  is  30  ms  or 
less.)  The  usefulness  of  sidetone  in  actual  environments  will  he  further  evaluated  by  the  user  reactions 
to  ANDVT  as  they  are  deployed  in  quantity  in  the  near  future. 

6.2  Speech  Spectral  Tilt  Equalization 

For  a  given  speech  sound,  the  amount  of  speech  spectral  tilt  varies  significantly  from  person  to 
person  l Fig.  25).  A  clear  ringing  voice  has  more  high-frequency  energies  (Fig.  25(a))  because  of  the 
following  favorable  characteristics  of  the  glottis  and  vocal  tract:  (a)  glottis  closes  instantly  ( i . e . .  wide¬ 
band  excitation),  (b)  glottis  closes  completely  (i.e..  a  good  "on-and-oft "  contrast),  (c)  vocal  tract  is 
not  lossy  lie.,  no  speech  leakage  from  the  nasal  passages).  Other  voices  have  weak  upper  bands  (Fig. 
25(b))  because  their  glottis  and  vocal  characteristics  are  opposite  of  these.  A  speaker  recognition  pro¬ 
cess  directly  or  indirectly  exploits  the  spectral  tilt  to  identify  or  verify  speakers. 

For  other  speech  applications,  however,  a  wide  variation  in  the  spectral  tilt  results  in  speaker- 
dependent  performance  because  the  L.PC  analysis  do^s  not  work  well  with  those  speech  signals  having 
weak  upper  frequency  components.  Therefore.  LPC  analysis  is  often  preceded  by  preemphasis  (high- 
frequency  boost).  Usually,  a  fixed  preemphasis  is  used.  Since  the  magnitude  of  the  spectral  tilt  is  dif¬ 
ferent  from  person  to  person,  a  preferred  preemphasis  would  be  an  adaptive  preemphasis  whereby  the 
amount  of  high-frequency  boost  is  self-controlled  by  the  amount  of  spectral  tilt  of  the  input  speech. 

Adaptive  preemphasis  is  accomplished  by  a  single-zero  filter  with  an  adaptive  filter  weight: 

M/)  ~  .v(/t  -  A  v(i  -  1 ).  (27) 

where  A  is  the  adaptive  preemphasis  factor,  and  id)  and  vd)  are  the  input  and  output  speech  sam¬ 
ples  We  chose  A  to  be  the  coefficient  of  the  first-order  linear  predictor  because  it  approximates  the 
speech  envelope  by  a  single  variable,  and  this  variable  contains  mainly  information  regarding  (he 
spectral  tilt  Thus. 


32 


Amplitude  Spectrum  (dB)  Amplitude  Spectrum  (dB) 


NRl  Rl  PORI  »:u(, 


sn 


Frequency  (kHz) 

(at  Speaker  No.  1 


50 


Frequency  (kHz) 

<bt  Speaker  No  2 


Fig  25  Speech  spectra  of  vowel  a  in  "wav"  Iron)  two  different  talkers  la)  a  Ocar 
and  ringing  voice  that  is  not  easily  drowned  be  ambient  noise  (a  good  voice  lor  cocktail 
parties),  (bl  a  typical  aging  voice  that  lacks  high-frequency  energies  The  1  PC  analvsis 
disfavors  the  speech  spectrum  that  is  heavily  tilted  Thus.  I. PC’  analvsis  is  usually  preceded 
by  a  preemphasis  (high-frequency  boosting):  it  has  always  been  a  lived  preemphasis 
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1.0  <  0  <  0.5 


(28) 


where  E  [•)  signifies  the  running  average  of  the  past  history,  which  is  on  the  order  of  1  s.  The 
theoretical  range  of  0  is  -1.0  and  1.0.  We.  however,  intentionally  limit  its  lower  range  to  0.5 
because  if  0  is  less  than  0.5,  the  speech  signal  has  strong  high-frequency  components  (i.e..  unvoiced 
fricatives  /s/.  /sh/,  /ch/,  etc.);  hence,  no  further  preemphasis  is  needed.  Figure  26  is  the  frequency 
response  of  the  adaptive  preemphasizer  for  various  values  of  preemphasis  factors  Since  the  quantity  0 
is  derived  through  long-term  averaging,  it  is  more  dependent  on  the  speaker's  vocal  timbre  than  the 
spoken  words  (which  has  been  smudged  by  long-term  averaging). 
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Fig  26  —  Adaptive  preemphasis  Filter  responses  for  various  preemphasis  taelors  (gi  from 
0.5  to  (1  9  at  an  increment  ot  0  1  For  comparison  purposes,  ail  the  frequence  responses 
are  normalized  to  have  a  units  gain  at  ItXXl  H/  A  voice  with  strong  high-frequency  com¬ 
ponents  (Fig  25(a))  achieves  a  smaller  J  value  of  0  77d.  therefore  high  frequencies  are  noi 
boosted  as  much  On  the  other  hand,  a  voice  without  strong  high-frequency  components 
(Fig  25(h))  attains  a  large  3  value  ol  0.907;  therefore  high  frequencies  are  boosted  more 
than  for  the  other  voice  Thus  both  speech  samples  have  more  balanced  lower  and  upper 
band  spectral  distribute  ns 


CONCLUSIONS 


In  this  report,  we  discuss  how  to  preprocess  the  speech  signal  in  such  a  way  that  the  subsequent 
digital  voice  processing  algorithm  can  function  at  its  best.  The  approach  presented  here  is  applicable 
to  speech  coding,  speech  recognition,  speaker  recognition,  or  any  other  processor  used  to  extract  ver¬ 
bal  or  nonverbal  information  from  speech. 
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A  preprocessor  is  no  longer  a  fixed-gain  amplifier  v«, ith  an  anti  ilia-mig  tiller  h  is  an  adaptive 
system  that  can  self-adjust  the  speech  level  and  remove  any  interleience  u  e  .  IX'  bias.  60  Hz  hum. 
digital  noise,  ambient  noise)  it  present  It  can  equalize  a  nontlat  microphone  response  The  preproces¬ 
sor  also  has  an  antialiasing  filter  that  has  an  excellent  roll-ott  characteristic  i  -  ISO  dB  per  octave) 
with  differential  group  delays  of  zero  anywhere  within  the  passband.  The  preprocessor  even  equalizes 
wide  variations  of  speech  spectral  tilt  to  improve  the  quality  of  the  extracted  speech  parameters  More 
importantly,  the  only  analog  circuit  we  use  is  a  variable-gain  amplifier  at  the  front  end  and  the  A  D 
converter.  Since  no  elaborate  analog  circuits  are  involved,  the  preprocessor  is  not  a  hindrance  to 
hardware  miniaturization. 

This  report  is  the  result  of  our  continuing  effort  to  make  voice  processors  more  reliable  and  to 
operate  successfully  in  difficult  real-world  environments. 
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